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Abstract 


Objectives:  Cleanup  of  subsurface  unexploded  ordnance  (UXO)  at  military  installations  and 
training  ranges  is  an  expensive  and  time-consuming  challenge.  While  the  goals  in  UXO 
remediation  are  very  clear,  to  cleanup  all  UXO  as  efficiently  as  possible,  little  effort  has  focused 
on  designing  robust  and  efficient  signal  processing  strategies  with  the  specific  performance  goal 
dictated  by  the  regulators  in  mind.  This  project  originated  from  the  hypothesis  that  performance 
and  robustness  may  be  improved  over  the  classical  approaches  by  specifically  considering  the 
desired  operating  point  of  the  UXO  discrimination  strategy  (100%  detection)  during  the 
construction  of  each  stage  of  the  signal  processing  sequence  that  is  needed  to  make  the  “dig/no 
dig”  decision.  From  a  statistical  decision  theory  perspective,  operating  at  this  specific  point  has 
implications  that  may  impose  a  strong  preference  for  certain  processing  techniques  in  the 
UXO/clutter  discrimination  process.  The  objective  of  this  project  was  to  conduct  a  preliminary 
investigation  into  the  potential  benefit  of  awareness  of  the  specific  performance  criterion  (100% 
UXO  detection)  in  each  stage  of  the  UXO  discrimination  processing  strategy.  This  work  should 
lead  to  new  strategies  for  training  and  classification,  and  may  suggest  guidelines  for  all  stages  of 
data  processing. 

Technical  Approach:  This  project  consisted  of  several  large-scale  classification  studies  to  carefully 
analyze  the  performance  of  different  classification  algorithms  and  the  effects  of  training  data  when 
operating  at  100%  UXO  detection.  The  data  used  in  this  study  was  collected  during  the  Camp  San 
Luis  Obispo  demonstration  with  the  MetalMapper  TEM  sensor.  The  various  classification 
algorithms  included  in  this  study  provide  a  diverse  representation  of  the  different  theoretical 
approaches  to  pattern  classification,  and  allow  for  comparison  of  the  effect  of  different  classifier 
properties  on  performance  at  the  100%  detection  operating  point. 

Results:  Across  a  large  number  of  experiments,  strong  performance  was  consistently  observed  with  a 
nonparametric  classification  algorithm  that  makes  decisions  locally  in  feature  spaced  based  on 
neighboring  training  samples.  Such  a  classifier  shares  properties  with  the  library-matching  classifiers 
that  are  often  used  in  the  UXO  research  community  for  classification  based  on  the  polarizability 
curves.  Additionally,  preliminary  analysis  of  a  method  for  evaluating  the  outputs  of  the  model 
inversion  procedure  shows  potential  for  identifying  potential  outliers  (which  drive  performance  at  the 
100%  detection  operating  point)  for  more  careful  follow-on  analysis. 

Benefits:  This  study  provides  evidence  that  the  desire  to  operate  at  100%  detection  may  lead  to  a 
preference  for  certain  algorithms  in  the  different  stages  of  the  discrimination  strategy.  Careful 
consideration  and  selection  of  methods  used  in  each  stage  of  discrimination  strategy  may  greatly 
impact  performance  at  the  100%  detection  goal.  This  study  provides  preliminary  work  towards 
guidelines  for  classifier  design,  use  of  training  data,  model  inversion,  and  feature  selection  in  the 
UXO  discrimination  algorithm  that  will  eventually  lead  to  more  robust  methods  of  data 
processing. 
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Objective 


The  objective  of  the  work  conducted  in  support  of  MR-1715  was  to  determine  the  degree  of 
benefit  achieved  when  using  a  “performance  operating-point  aware”  approach  to  UXO 
discrimination.  A  series  of  experiments  were  conducted  to  investigate  two  basic  research  thrust 
areas:  (1)  a  re-consideration  of  the  classifier  methods  and  training  set  design  for  operating  at 
100%  UXO  detection,  and  (2)  development  of  guidelines  for  the  data  pre-processing,  model 
inversion,  and  feature  selection  based  on  quantification  of  the  sensitivity  of  the  100%  UXO 
detection  operating  point  to  these  prerequisite  data  processing  stages.  This  investigation  sought 
to  assess  the  impact  of  specifically  considering  the  performance  criteria  required  by  the 
regulators  (100%  UXO  detection)  during  the  design  of  each  stage  of  the  UXO  discrimination 
processing  strategy.  From  a  statistical  decision  theory  perspective,  operating  at  PD=100%  has 
implications  that  may  impose  a  strong  preference  for  certain  processing  techniques  in  the 
UXO/clutter  discrimination  process.  The  goal  of  this  work  was  to  produce  a  proof-of-concept 
that  adopting  this  perspective  when  designing  a  UXO/clutter  discrimination  strategy  would 
suggest  favoring  or  avoiding  certain  methods  and  techniques  in  the  end-to-end  data  processing 
strategy. 

This  SEED  project  addresses  Statement  of  Need  (SON)  MMSEED- 10-01.  There  is  a 
clear  need  to  effectively  and  cost-efficiently  remediate  UXO  contaminated  lands,  rendering  them 
safe  for  their  current  or  intended  civilian  uses.  Understandably,  the  UXO  regulatory  cleanup 
community  is  strongly  averse  to  the  possibility  of  leaving  behind  UXO;  it  is  a  significant  liability 
when  land  committed  to  public  use  is  later  found  to  be  contaminated.  Therefore,  to  ensure 
efficient  performance  at  the  desired  operating  point  for  the  UXO/clutter  discrimination  strategy 
(find  all  of  the  UXO  while  at  the  same  time  reducing  the  number  of  false  alarms),  it  may  be 
necessary  to  take  the  specific  performance  goals  into  consideration  during  the  design  of  the  end- 
to-end  discrimination  strategy.  This  proposed  basic  research  program  was  to  serve  as  a  proof-of- 
concept  to  support  our  hypothesis  that  an  awareness  of  the  100%  UXO  detection  operating  point 
during  design  of  the  end-to-end  discrimination  strategy  will  lead  to  guidelines  and  preferences 
for  certain  techniques  and  algorithms  at  many  stages  of  the  overall  discrimination  strategy. 

Background 

There  are  many  areas  in  the  United  States  and  throughout  the  world  that  are  contaminated  by  or 
potentially  contaminated  by  unexploded  ordnance.  In  the  United  States  alone  there  are  1900 
Formerly  Used  Defense  Sites  (FUDS)  and  130  Base  Realignment  and  Closure  (BRAC) 
installations  that  need  to  be  cleared  of  UXO.  Using  current  technologies,  the  cost  of  identifying 
and  disposing  of  UXO  in  the  United  States  is  estimated  to  range  up  to  $500  billion.  Site  specific 
clearance  costs  vary  from  $400/acre  for  surface  UXO  to  $1.4  million/acre  for  subsurface  UXO 
[1].  These  approaches,  however,  usually  require  significant  amounts  of  human  analyst  time,  and 
thus  those  additional  costs,  which  are  currently  necessary  parts  of  ongoing  demonstrations,  are 
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not  factored  into  these  numbers.  Thus,  there  is  a  clear  need  to  effectively  and  cost-efficiently 
remediate  UXO  contaminated  lands,  rendering  them  safe  for  their  current  or  intended  civilian 
uses.  Development  of  new  UXO  detection  technologies  with  improved  data  analysis  has  been 
identified  as  a  high  priority  requirement  for  over  a  decade. 

Several  sensor  modalities  have  been  explored  for  the  detection  and  identification  of 
surface  and  buried  UXO.  These  include  electromagnetic  induction  (EMI),  magnetometry,  radar, 
and  seismic  sensors.  These  sensors  generally  experience  little  difficulty  detecting  UXO,  thus 
detection  does  not  create  the  bottleneck  that  results  in  the  high  cost  of  remediating  sites.  The 
primary  contributor  to  the  costs  and  time  associated  with  remediating  a  UXO-contaminated  site 
is  the  high  false-alarm  rate  caused  by  the  significant  amounts  of  non-UXO  clutter  and  shrapnel 
typically  found  on  battlefields  and  military  ranges.  A  significant  contributing  factor  in  the  high 
false  alarm  rate  is  the  requirement  for  high  confidence  in  the  removal  of  all  UXO  during  site 
cleanup. 

On  sites  where  anomalies  are  well  separated,  statistical  signal  processing  algorithms  that 
exploit  recent  advances  in  sensor  design  and  phenomenological  modeling  have  been  successfully 
employed  and  substantial  improvements  in  performance  over  traditional  “mag  and  flag” 
approaches  have  been  demonstrated  [2-1].  Recent  results  from  the  former  Camp  San  Luis 
Obispo  and  Camp  Butner  demonstrations  clearly  demonstrate  that  good  discrimination  can  be 
effected,  and  in  the  case  of  the  Camp  Sibert  demonstration,  all  UXO  could  be  identified  with  a 
substantial  reduction  in  the  number  of  “dig”  declarations.  Using  nomenclature  from  decision 
theory,  we  can  refer  to  this  operating  point  as  the  Pd  =  1  operating  point,  which  corresponds  to  a 
UXO  probability  of  detection  (PD)  equaling  unity.  This  is  the  desired  operating  point  in  the 
UXO  discrimination  scheme;  results  seen  at  recent  demonstrations  indicate  that  the  sensors  and 
algorithms  have  matured  to  the  point  that,  under  relatively  benign  test  conditions,  it  can  be 
considered  more  specifically.  The  UXO  regulator  community  is  highly  averse  to  the  possibility 
of  leaving  behind  UXO;  it  is  a  significant  liability  when  land  committed  to  public  use  is  later 
found  to  be  contaminated.  Thus,  it  is  recommended  that  the  research  community  recognize  and 
adopt  this  operating  point  as  their  standard  for  measuring  performance,  such  that  they  are  able  to 
meet  the  needs  of  the  UXO  cleanup  community.  If  the  technology  is  incapable  of  meeting 
regulators  needs,  then  they  will  not  be  willing  to  adopt  it. 

Since  government  regulators  require  reliable  assurances  that  all  UXO  have  been  cleared 
from  a  site  during  cleanup  operations,  the  development  of  algorithms  for  discriminating  UXO 
from  clutter  (reducing  cleanup  expenses)  may  benefit  if  such  operating  conditions  are 
specifically  considered  during  the  design  phase.  Standard  classical  approaches  to  detection  and 
discrimination  are  not  guaranteed  to  perform  well  at  the  PD  =  1  operating  point.  The  PD  =  1 
operating  point  is  defined  by  the  last  UXO  to  be  found;  thus  performance  is  constrained  by  the 
most  “anomalous”  or  outlier  UXO  in  the  target  set.  This  perspective  could  significantly  alter  the 
framework  of  the  UXO  discrimination  strategies,  by  focusing  specifically  on  robustly  identifying 
the  most  difficult  UXO  target.  From  a  statistical  decision  theory  perspective,  operating  at  the  Pd 
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=  1  point  has  implications  that  may  impose  a  strong  preference  for  certain  processing  techniques 
in  the  UXO/clutter  discrimination  process.  There  are  many  stages  in  current  state-of-the-art 
UXO  discrimination  strategies,  such  as  phenomenological  model  inversion,  feature  generation 
and  selection,  and  selection  of  a  classification  algorithm.  If  each  stage  of  the  discrimination 
relies  on  standard,  typical  methods  found  in  the  research  literature  for  model  inversion,  feature 
selection,  and  classifier  training,  it  is  reasonable  to  expect  that  outliers  would  be  present  in  the 
output  of  each  stage  since  classical  approaches  are  often  designed  specifically  to  solve  the  most 
common  or  expected  scenario,  and  accept  outliers  as  part  of  the  distribution  of  possible 
observations.  To  maximize  performance  of  the  UXO  discrimination  strategy  at  the  PD  =  1 
operating  point,  and  to  satisfy  the  government  regulators,  it  is  an  interesting  basic  research 
question  to  propose  that  each  stage  actively  attempts  to  mitigate  the  occurrence  of  outlier  UXO. 
In  decision  theory,  this  difference  can  be  distinguished  as  the  difference  between  error 
minimization  and  cost  minimization. 

Any  UXO  discrimination  strategy  is  capable  of  two  types  of  error:  making  a  “dig” 
declaration  for  a  harmless  clutter  object  (termed  a  “false  alarm”  or  “false  positive”),  and  making 
a  “no  dig”  declaration  for  a  UXO  (termed  a  “miss”  or  “false  negative”).  As  both  of  these 
scenarios  (false  alarms  and  misses)  are  errors,  an  error-minimization  approach  would  be  equally 
motivated  to  avoid  either  of  these  events,  since  both  contribute  equally  to  the  overall  number  of 
errors.  However,  a  more  appropriate  consideration  in  UXO  cleanup  scenarios  is  cost,  rather  than 
errors.  Costs,  or  penalties,  can  be  assigned  to  the  different  combinations  of  “dig  /  no  dig” 
declaration  and  true  anomaly  class.  Since  the  cost  of  a  miss  (not  making  a  “dig”  declaration  for 
a  UXO)  far  exceeds  the  cost  of  false  alarm  (unnecessarily  making  a  “dig”  declaration  for  a 
harmless  clutter  object),  a  cost-minimizing  decision  criterion  will  shift  the  bias  towards  making 
the  least  costly  errors  (i.e.  false  alarms).  This  has  previously  been  considered  by  Carin  et  al.  [8] 
using  a  POMDP-based  approach  to  policy  implementation.  The  framework  was  able  to  consider 
various  costs,  such  as  the  cost  of  making  an  observation  with  another  sensor  or  the  cost  of 
making  either  a  “dig”  or  “no  dig”  declaration  given  the  data  available.  However,  as  will  be 
outlined  in  the  next  section,  an  alternative  approach  to  minimizing  the  number  of  dig 
declarations  to  reach  the  Pd  =1  operating  point  may  yield  better  performance  when  considering 
potential  drawbacks  associated  with  the  cost-minimizing  approach.  Rather  than  being  too 
dependent  on  sensitivity  to  the  cost  estimates,  as  are  all  cost-minimization  approaches  such  as 
[8],  our  hypothesis  is  that  we  may  be  able  to  achieve  better  performance  at  the  Pd  =  1  operating 
point  by  first  analyzing  the  behavior  and  system  dynamics  of  the  UXO  discrimination  strategy, 
examining  how  outliers  are  produced  and  handled  at  various  stages  in  the  data  processing,  and 
developing  guidelines  for  processing  architecture  that  will  mitigate  such  events. 

General  UXO  discrimination  strategy  framework 

Many  of  the  current  approaches  to  UXO  discrimination  currently  utilized  by  the  research  and 
cleanup  community  can  be  described  within  the  general  framework  presented  in  Figure  1 .  In  this 
general  framework,  there  are  four  stages  in  the  UXO  discrimination  process  that  occur  between 


4 


Data 

Figure  1 .  General  framework  of  UXO/clutter  discrimination  strategy,  from  raw  data  to  “dig  /  no  dig”  declaration. 
There  are  four  stages:  1)  data  pre-processing,  2)  model  inversion,  3)  feature  generation  and  selection,  and  4) 
classification  using  labeled  training  data. 


the  measurement  of  the  raw  sensor  data  and  making  a  “dig  /  no  dig”  declaration.  The  first  stage 
is  data  pre-processing,  which  includes  the  selection  of  sensor  measurements  and  time  samples, 
and  application  of  any  normalization  or  background  correction.  The  second  stage  of  processing 
involves  model  inversion  to  fit  the  measured  sensor  data.  Models  can  incorporate  various 
assumptions  about  the  shape  of  the  EMI  signatures  as  a  function  of  time  and  have  different 
physical  or  empirical  motivations.  Often,  it  is  possible  to  make  a  trade-off  between  model  rigor 
and  expressivity.  Model  inversion  and  numerical  optimization  is  a  significant  field  of  research  in 
and  of  itself.  Various  methods  can  be  considered  to  manage  the  presence  of  local  minima,  high- 
dimensionality  of  the  parameter  space,  and  data  quality. 

The  third  stage  of  processing  uses  the  phenomenological  model  parameters  estimated  via 
the  model  inversion  to  generate  features  for  UXO/clutter  discrimination.  Relevant  features  can 
be  generated  based  on  knowledge  regarding  the  phenomenological  models;  ideally,  physics 
suggests  that  the  features  should  correspond  to  the  intrinsic  characteristics  of  the  anomaly  in 
order  to  be  useful  for  discriminating  UXO  from  clutter.  A  large  set  of  features,  which  commonly 
occurs  for  more  advanced  multi-axis  sensors,  can  be  reduced  through  downselection  using 
feature  selection  techniques.  There  are  many  relevant  feature  selection  techniques  that  can  be 
differentiated  based  on  their  methods  for  generating  candidate  sets  of  features  and  scoring  the 
suitability  of  different  feature  combinations.  Alternatively,  some  investigators  use  the  entire 
model  fit  and  perform  library-matching  with  the  polarizability  curves.  Such  an  approach  fits 
within  the  framework  presented  in  Figure  1;  the  “features”  from  which  decision  outputs  will  be 
generated  are  the  raw  model  inversion  outputs,  generated  via  a  1-to-l  mapping  rather  than  a 
more  complex  feature  generation  scheme  that  aggregates  the  raw  outputs  into  a  more  concise  set 
of  values.  The  final  stage  of  the  UXO  discrimination  strategy  is  classification.  In  this  stage,  the 
features  associated  with  an  anomaly  of  interest  are  analyzed  with  respect  to  a  set  of  features  from 
training  samples  using  a  classification  algorithm,  and  a  score  is  assigned  to  the  new  anomaly  by 
the  classifier.  Based  on  various  threshold  and  decision  criteria,  the  score  can  be  converted  into  a 
“dig  /  no  dig”  declaration.  The  design  of  the  first  two  processing  stages  benefits  from  a  high 
level  understanding  of  geophysics,  sensor  phenomenology,  and  numerical  optimization,  whereas 
the  design  of  the  final  stage  benefits  from  a  strong  background  in  statistical  signal  processing  and 
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pattern  recognition.  Feature  generation  and  selection  requires  experience  in  both  areas  to  be 
performed  in  an  optimal  fashion. 

Several  groups  have  successfully  implemented  various  techniques  for  each  stage  of  the 
UXO/clutter  discrimination  framework  (e.g.  [8-13]).  Data  pre-processing  can  include 
background  correction  and  sensor  position  uncertainty  modeling  (e.g.  [14,  15]).  Model  inversion 
has  been  performed  using  standard  gradient  descent  approaches  and  stochastic,  evolutionary 
inspired  methods  [16-18].  Published  results  have  used  the  generalized  likelihood  ratio  test 
(GLRT),  support  vector  machine  (SVM),  and  artificial  neural  networks  (ANN)  as  classifier 
methods  in  UXO/clutter  discrimination.  These  are  standard  signal  processing  solutions  to  the 
types  of  problems  posed  by  each  stage  of  the  UXO  discrimination  strategy.  However,  it  has  not 
been  documented  whether  these  techniques  are  optimized  for  the  operating  point  required  by  the 
UXO  cleanup  task  (PD  =  1).  Therefore,  given  the  extensive  number  of  parameters  and  options 
associated  with  each  stage,  it  is  reasonable  to  suspect  that  certain  techniques  may  perform  better 
at  the  required  operating  point  than  others.  Additionally,  a  set  of  guidelines  for  use  may  be 
beneficial  for  improving  performance  at  the  Pd  =  1  operating  point. 

Relevant  theory 

In  this  section,  the  relevant  theory  and  concepts  that  motivate  the  methods  and  techniques 
applied  in  this  project  will  be  described  briefly.  There  is  a  relevant  body  of  work  from  the 
decision  theory,  pattern  recognition,  and  machine  learning  literature  that  can  be  leveraged  in  this 
effort  to  maximize  the  performance  of  a  UXO  discrimination  strategy  designed  to  perform  at  a 
specific  operating  point.  This  section  will  also  highlight  the  novelty  of  the  work  performed  in 
this  investigation. 

Statistical  decision  theory  offers  several  methods  for  determining  the  operating  point  of 
the  UXO/clutter  discrimination  strategy,  such  as  the  Bayesian  approach  for  calculating  the 
expected  cost  of  each  decision  or  the  Neyman  Pearson  criterion  that  optimizes  performance  at  a 
user-defined  number  of  false  alarms  or  missed  detections.  However,  the  drawback  of  any  of 
these  criteria  associated  with  decision  theory  is  that  they  are  applied  ex  post,  i.e.  once  the 
classifier  outputs  are  already  determined.  These  approaches  find  the  desired  operating  point  on 
the  ROC  curve,  but  do  not  explicitly  improve  performance  at  that  point.  To  improve 
performance,  the  ROC  curve  which  describes  all  possible  operating  points  for  the  classification 
system  needs  to  also  improve.  As  is  well  established,  there  can  be  substantial  variability  within 
the  ROC  curves  produced  by  a  classifier  on  a  fixed  set  of  testing  and  training  data,  simply  by 
changing  the  parameters  associated  with  most  classifiers.  Beran  and  Oldenburg  observed  such 
trends  in  comparisons  of  a  support  vector  machine,  a  probabilistic  neural  network,  and  library 
based  techniques  on  data  sets  from  two  sensors,  and  recommended  a  careful  approach  to  the 
design  of  a  classification  method  for  UXO  discrimination  [19].  In  the  UXO  discrimination  task, 
we  are  willing  to  sacrifice  performance  at  operating  points  not  of  interest  in  UXO/clutter 
discrimination  to  improve  performance  at  Pd  =  1.  Therefore,  what  is  proposed  in  this  research  is 
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a  systematic  methodology  to  determine  what  characteristics  of  the  classifier  might  produce  ROC 
curves  with  better  performance  at  the  PD  =  1  operating  point.  Decision  criteria,  either  cost 
minimization  or  Neyman  Pearson,  can  help  find  that  point  on  the  ROC,  but  they  alone  cannot 
improve  performance  at  the  Pd  =  1  operating  point  in  the  classifier  design  stage. 

A  body  of  work  has  focused  on  the  area  of  cost-sensitive  classification.  Recognizing  the 
distinction  between  minimizing  error  and  minimizing  cost,  other  researchers  have  sought  to 
modify  error-minimizing  classifiers,  such  as  the  decision  tree  and  support  vector  machine,  to 
operate  in  cost-minimization  scenarios.  Two  dominant  approaches  have  emerged  from  this 
research  area.  The  first  method  is  termed  stratification  [20].  In  stratification,  the  set  of  training 
data  is  modified,  either  through  weighting  or  resampling  of  the  data  points,  such  that  the 
proportion  of  samples  from  each  class  is  consistent  with  the  costs.  To  implement  such  a 
technique  in  the  UXO  discrimination  task  would  require  significant  modification  to  the  design  of 
the  training  data  set.  For  the  UXO  cleanup  task,  labeled  samples  for  training  the  classifier  are 
already  of  limited  availability  and  difficult  to  collect;  thus  such  extensive  modification  to  the 
training  set  is  difficult  to  support.  Additionally,  given  the  highly  disproportionate  costs  of 
missed  UXO  and  false  alarms,  it  is  anticipated  that  the  stratification  framework  would  simply 
reject  all  clutter  samples  and  retain  only  UXO.  The  second  technique  for  cost-sensitive 
classification  is  MetaCost  [21],  a  wrapper  method  that  extends  cost-sensitive  classification  to  any 
classification  algorithm.  In  MetaCost,  the  training  samples  might  be  relabeled  according  to  their 
“cost-minimizing  label”  if  it  differs  from  the  true  training  sample  label.  The  classifier  is  then  re¬ 
trained  using  the  cost-minimizing  class  labels  for  the  training  data.  These  cost-minimization 
based  techniques  (stratification,  MetaCost,  and  the  previously-mentioned  POMDP-based 
approach  to  policy  implementation)  may  not  yield  ideal  results  due  to  their  dependency  on  the 
estimated  costs  for  the  possible  errors.  The  elegant  simplicity  of  Bayesian  cost-minimization 
may  hamper  its  use  in  UXO  discrimination,  since  Pd  =  1  can  only  be  satisfied  theoretically  if  the 
cost  of  a  missed  UXO  is  asymptotically  approaching  infinity,  and  “dig”  declarations  are  made 
for  all  anomalies.  Even  when  provided  with  more  realistic  estimates  of  cost,  the  decisions  will 
be  quite  sensitive  to  the  estimates  of  cost,  resulting  in  potentially  unstable  performance  near  the 
operating  point. 

A  recent  trend  in  the  development  of  pattern  classification  methods  is  the  increasing 
popularity  of  hybrid  and  meta-classifiers,  which  combine  two  or  more  component  classifiers  in 
an  overall  classification  scheme.  These  methods  have  been  shown  in  many  instances  to  provide 
improved  performance  and  robust  behavior  in  several  applications  [22-24].  There  are  various 
design  considerations  in  the  construction  of  a  meta-classifier:  the  training  data  used  for  each 
component,  the  selection  of  component  classifiers,  and  the  structure  of  the  meta-classifier. 
Training  data  can  be  sampled  randomly  using  a  technique  termed  bootstrapping,  or  specifically 
tailored  for  each  component  to  accomplish  a  specific  task  (i.e.  context-dependent  training). 
Component  classifiers  can  all  be  of  the  same  type,  such  as  a  collection  of  classification  trees  used 
in  the  Random  Forest  classifier  [25],  or  they  can  have  distinct  characteristics,  as  in  the 
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generative/discriminative  hybrid  classifier  frameworks  [22-24].  Finally,  the  meta-classifier  can 
be  structured  as  a  parallel  combination  of  components,  as  in  the  Random  Forest  or  hierarchical 
mixture  of  experts  [26],  or  cascaded,  as  in  the  popular  AdaBoost  method  [27]  and  several 
implementations  of  hybrid  classifiers  [23,  24], 

Materials  and  Methods 

The  experiments  conducted  in  this  SEED  effort  relied  extensively  on  data  collected  as  part  of  the 
Former  Camp  San  Luis  Obispo  demonstration.  Data  and  model  fits  were  provided  by  Snyder 
Geoscience  for  the  MetalMapper  sensors  with  the  MM/RMP  model  [28],  This  model  performs 
sequential  stages  of  estimation:  first,  nonlinear  estimation  to  determine  position  and  the 
symmetric  matrix  of  polarizability  transients,  then  linear  estimation  of  the  three  attitude  angles 
and  three  principal  polarizability  transients,  followed  by  parametric  curve-fitting.  A  total  of 
1072  anomalies  (887  clutter,  185  UXO)  were  taken  from  the  cued  identification  survey  and  used 
for  the  discrimination  experiments  in  this  study.  From  the  set  of  available  model  parameters,  a 
subset  of  nine  features  were  selected;  thus,  the  data  was  capable  of  being  formatted  in  a  1072  by 
9  matrix.  The  motivation  for  manual  selection  of  a  feature  subset  rather  than  using  empirical 
results  from  feature  selection  algorithms  was  to  reduce  the  opportunity  for  bias  towards  one  of 
the  classification  algorithms  that  was  to  be  evaluated  in  this  study.  The  most  common  methods 
for  feature  selection  (either  information-theoretic  filter  methods  or  classifier-dependent  wrapper 
methods  [29,  30])  will  assume  an  underlying  model  for  the  distribution  of  anomalies  in  each 
class.  Therefore,  the  feature  subset  was  selected  based  on  descriptions  of  the  physical  target 
characteristics  that  are  represented  by  each  feature  and  their  likelihood  for  representing  intrinsic 
characteristics  of  the  anomaly.  Based  on  this  analysis,  the  following  nine  features  were  selected: 

•  NormPO,  NormPl :  RMS  values  of  P0X,P0Y,P0Z  and  P1X,P1Y,P1Z,  which  represent  the 
numerical  integration  of  the  principal  polarizability  curves  and  their  first  moments, 
respectively. 

•  P1X,  P1Y,  P1Z:  Numerical  integration  of  the  first  moment  of  the  principal  polarizability 
curves. 

•  P0_R,  P0_E:  Measure  of  aspect  ratio  and  eccentricity  based  on  principal  polarizability 
curves. 

•  P1_R,  P1_E:  Measure  of  aspect  ratio  and  eccentricity  based  on  first  moment  of  the 
principal  polarizability  curves. 

Through  follow-up  communications  with  Skip  Snyder  of  Snyder  Geoscience  it  was  revealed  that 
six  of  the  nine  features  match  up  with  the  feature  set  they  used  at  the  recent  Camp  Butner 
demonstration1.  To  illustrate  the  separation  of  UXO  and  clutter  objects  using  the  nine  features 
selected  for  use  in  this  study,  a  two-dimensional  principal  component  projection  of  the  feature 


1  Snyder  used  P1_T,  equal  to  the  average  of  PI  Y  and  P1Z,  in  addition  to  P1X,  P0_E,  P0_R,  P1_E,  and  P1_R. 
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Figure  2.  Representation  of  the  nine-dimensional  feature  space  in  two  dimensions  using  principal  components 
analysis.  The  x-  and  y-axis  correspond  to  the  scores  along  the  first  and  second  principal  components,  respectively. 


set  is  shown  in  Figure  2.  The  lower-dimensional  subspace  reveals  strong  clustering  of  the  UXO, 
with  greater  variability  observed  in  the  clutter  features 

This  SEED  project  relied  extensively  on  the  use  of  large-scale  Monte  Carlo  simulations 
to  thoroughly  analyze  the  properties  and  performance  of  the  different  algorithms  and  stages  in 
the  UXO  discrimination  scheme.  Large  sets  of  independent  classification  results  were  generated 
by  randomly-sampling  from  the  available  set  of  1072  anomalies.  Two  methods  of  sampling 
were  employed  in  this  study.  The  primary  means  of  constructing  subsets  of  data  was  by 
sampling  without  replacement  to  divide  the  1072  anomalies  into  two  data  sets  (one  for  classifier 
training  and  a  second  for  validation).  Alternatively,  for  certain  experiments  it  was  desired  to 
measure  the  distribution  of  a  measure  or  statistics  (e.g.  Pfa  @  Pd  =1)-  In  these  instances, 
bootstrapping  (sampling  with  replacement)  was  utilized  to  estimate  the  underlying  probability 
distribution  for  the  measure  of  interest. 

One  of  the  main  focuses  of  this  SEED  study  was  the  statistical  classification  algorithm 
that  is  used  in  the  final  stages  of  the  UXO/clutter  discrimination  strategy.  There  are  many 
possible  choices  for  algorithms  to  use  in  this  stage,  and  several  experiments  examined  the 
various  properties  of  difference  classifiers  to  evaluate  their  potential  for  operating  at  Pd  =1. 
Table  1  lists  the  classifiers  considered  in  this  study,  along  with  several  significant  properties. 
The  third  column  of  the  table  indicates  whether  the  classifier  is  generative  or  discriminative. 
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Generative  classifiers  attempt  to  model  the  distributions  of  each  class  (i.e.  UXO  and 
clutter),  and  use  that  information  to  make  classification  decisions.  Discriminative  classifiers  do 
not  attempt  to  learn  the  class-specific  feature  distributions  and  instead  only  attempt  to  learn  the 
boundary  line  separating  the  two  classes.  Thus,  the  discriminative  classifier  is  often  viewed  as 
having  a  simpler  learning  task.  The  second  classifier  property  is  local  versus  aggregate  use  of 
training  data.  Local  classifiers  make  a  classification  decision  based  only  on  the  neighboring 
training  samples,  whereas  aggregate  classifiers  rely  on  parameters  that  are  calculated  from  all  of 
the  available  training  data.  A  simple  test  for  whether  a  classifier  uses  “local”  or  “aggregate” 
training  data  can  be  conducted  by  analyzing  whether  the  classifier’s  output  for  some  test  sample 
xtest  would  be  sensitive  to  the  addition  of  a  large  amount  of  new  training  data  at  a  point  in 
feature  space  not  near  xTest-  If  the  classifier’s  output  is  not  affected  by  the  addition  of  new 
training  samples,  the  classifier  makes  “local”  decisions.  The  final  classifier  property  specified  in 
the  table  is  parametric  versus  nonparametric  classifiers.  Parametric  classifiers  make  use  of  a 
model  to  condense  the  information  in  the  training  data  to  a  finite  number  of  parameters,  whereas 
a  nonparametric  classifier  preserves  the  entire  set  of  training  data  for  making  decisions  on  test 
data.  Thus,  the  storage  requirements  increase  for  nonparametric  classifiers  as  more  training  data 
is  acquired. 


Table  1.  List  of  classifier  algorithms  and  relevant  properties. 


Classifier 

Acronym 

Generative  / 
Discriminative 

Local  / 
Aggregate 

Parametric  / 
Nonparametric 

Reference 

Generalized 
Likelihood  Ratio  Test 

GLRT 

Generative 

Aggregate 

Parametric 

[31] 

Distance  Likelihood 
Ratio  Test 

DLRT 

Generative 

Local 

Nonparametric 

[32] 

K  Nearest  Neighbor 

KNN 

Generative 

Local 

Nonparametric 

[33] 

Linear  Discriminant 

FLD 

Discriminative 

Aggregate 

Parametric 

[33] 

Support  Vector 
Machine 

SVM 

Discriminative 

Aggregate 

Parametric 

[26,  34] 

Relevance  Vector 
Machine 

RVM 

Discriminative 

Aggregate 

Parametric 

[35] 

Random  Forest 

RF 

Discriminative 

Aggregate 

Nonparametric 

[25] 

Artificial  Neural 
Network 

ANN 

Discriminative 

Aggregate 

Parametric 

[33] 
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Results  and  Discussion 


Task  #1:  Classifier  design 

The  first  experiments  in  Task  #1  focused  on  assessment  of  the  performance  of  the  different 
classifiers  (listed  in  Table  1)  at  Pd  =  1.  The  performance  of  these  classifiers  was  analyzed  based 
on  performance  statistics  for  1000  data  sets  generated  using  the  Monte  Carlo  sample- without- 
replacement  experiment  design.  Each  of  the  1000  test  sets  consisted  of  100  randomly-selected 
anomalies;  the  remaining  972  anomalies  were  used  for  training  data.  The  Pfa  at  Pd  =1  was 
calculated  from  the  resulting  ROC  curve  for  each  data  set  and  each  classifier.  Figure  3  shows  the 
distribution  of  Pfa  values  over  the  1000  data  sets.  In  these  box  plots,  the  red  line  indicates  the 
median  value,  the  edges  of  the  box  correspond  to  the  25th  and  75th  percentile,  and  the  whiskers 
show  the  extent  of  the  remaining  data  points  (excluding  outliers,  shown  as  red  markers,  which 
are  above  the  75th  percentile  or  below  the  25th  percentile  by  more  than  1.5  times  the  interquartile 
difference).  The  DLRT  and  Random  Forest  classifiers  appear  to  have  most  of  the  results  with 
the  lowest  Pfa  values.  The  FLD  and  SVM  classifiers  were  able  to  achieve  low  Pfa  for  many  of 
the  data  sets,  but  also  have  a  large  proportion  of  instances  where  Pfa  is  quite  high  (greater  than 
0.8).  The  GLRT  classifier  has  a  similarly  large  range  of  PFa  values,  and  seems  to  be  quite 
dependent  on  the  specific  training/test  data  set. 


RVM  DLRT  GLRT  FLD  SVM  RF 

Classifier 


Figure  3.  Boxplot  showing  distribution  of  PFA  at  PD  =1  over  1000  sets  of  training/test  data  for  six  classifiers. 
Red  lines  identify  the  median  of  the  distribution  and  the  edges  of  the  blue  box  extend  to  the  25th  and  75th 
percentiles.  The  red  hash  marks  identify  outliers  that  are  beyond  the  25th  and  75th  percentiles  by  more  than  1.5 
times  the  interquartile  distance. 
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Figure  4.  Performance  rankings  for  each  classifier  algorithm  over  all  1000  training/test  sets  corresponding  to  the 
results  presented  in  Figure  3.  Each  column  in  each  subplot  indicates  the  number  of  instances  (out  of  1000 
possible)  that  the  classifier  had  the  lowest  PFa  (rank  of  1)  though  highest  PFa  (rank  of  6). 


An  alternative  view  of  these  experiments  was  generated  by  ranking  the  Pfa  values  for  the 
six  classification  algorithms  (from  1  to  6,  with  1  corresponding  to  the  lowest  Pfa  value)  for  each 
of  the  1000  test  cases.  A  histogram  of  the  rankings  for  each  classifier  is  shown  in  Figure  4.  In 
this  analysis  based  on  the  ranking  of  each  classifier  (in  terms  of  Pfa),  the  DLRT  and  RVM 
classifiers  appear  to  have  the  best  performance;  i.e.  for  most  data  sets  they  outperform  the  other 
four  classifiers  and  provide  either  the  lowest  or  second-lowest  Pfa- 

A  second  experiment  was  set  up  to  examine  the  propensity  of  a  classifier  to  declare  UXO 
at  points  in  feature  space  where  it  has  previously  observed  clutter  in  the  training  dataset.  This 
classifier  property  may  be  necessary  to  extend  the  decision  boundary  to  encompass  enough  of  the 
feature  space  to  operate  at  Pd  =1.  In  this  experiment,  each  classifier  was  trained  on  all  available 
data,  except  for  three  randomly-selected  UXO  held  out  for  use  in  the  test  set.  In  the  place  of 
these  three  UXO,  three  observations  labeled  as  “clutter”  were  inserted  into  the  training  data  with 
the  same  feature  values  as  the  held-out  UXO.  In  the  test  stage,  the  classifier  was  evaluated  on 
the  three  held-out  UXO  (with  feature  values  matching  the  “clutter”-labeled  observations  inserted 
into  the  training  data)  and  the  clutter  from  the  training  set.  The  performance  metric  calculated 
for  each  of  the  1000  iterations  in  this  experiment  was  Pfa  with  100%  detection  of  the  3  hold-out 
UXO.  The  rankings  (by  lowest  Pfa)  across  the  1000  iterations  are  shown  in  Figure  5.  The 
DLRT  classifier  was  capable  of  detecting  the  three  UXO  with  the  lowest  Pfa  for  most  datasets. 
This  suggests  that  the  DLRT  is  the  classifier  most  capable  of  ignoring  a  few  clutter  at  the 
periphery  of  the  main  UXO  cluster  in  feature  space  to  find  UXO  that  are  mild  outliers. 
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Another  comparison  was  made  to  examine  the  similarity  of  the  decisions  produced  by 
each  classifier.  If  two  classifiers  produce  decisions  that  are  not  highly  correlated,  then  the 
combination  of  the  classifiers  may  result  in  a  system  that  has  greater  information  available  for 
detecting  UXO,  leading  to  higher  performance  with  the  fusion-classifier  approach.  In  this 
experiment,  a  test  set  was  constructed  by  randomly  drawing  10  UXO  and  10  clutter  items.  This 
test  set  was  held  fixed  for  100  iterations  of  the  classification  study  using  training  data  that 
consisted  of  150  anomalies  from  each  class  (300  in  total)  drawn  at  random.  Thus,  a  matrix  of 
decision  metrics  with  dimensions  (nClassifiers  x  100)  by  20  could  be  constructed  (where 
nClassifiers  is  the  number  of  classifier  algorithms  considered  in  the  experiment).  The  decision 
metrics  were  normalized  within  each  row  of  the  matrix  to  be  unit-variance  and  zero  mean,  and 
the  distances  between  the  decision  outputs  for  each  iteration  (i.e.  each  different  training  set)  and 
each  classifier  were  calculated.  This  distance  matrix  can  then  be  decomposed  using  principal 
component  analysis  and  represented  in  a  two-dimensional  space.  This  technique  is  knows  as 
multidimensional  scaling  [33],  which  attempts  to  represent  a  data  set  in  a  lower-dimensional 
space  while  maintaining  the  relative  distances  between  points  in  the  higher-dimensional  space. 
If  any  individual  classifier  is  not  affected  by  the  variability  in  the  training  set  over  the  100 
iterations,  then  the  decision  metrics  would  be  identical  for  that  classifier  and  the  calculated 
distances  would  be  zero,  thus  all  100  points  for  that  classifier  would  occur  at  a  single  point  in  the 
multidimensional  scaling  subspace. 
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Figure  5.  Performance  rankings  for  each  classifier  algorithm  over  all  1000  training/test  sets  in  the  experiment 
testing  classifier  propensity  to  declare  UXO.  Each  column  in  each  subplot  indicates  the  number  of  instances  (out  of 
1000  possible)  that  the  classifier  had  the  lowest  PFA  (rank  of  1)  though  highest  PFA  (rank  of  6). 
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Figure  6.  Multidimensional  scaling  representation  of  the  classifier  algorithm  outputs  for  five  distinct  test  sets 
(corresponding  to  the  five  subplots).  Within  each  subplot,  each  point  represents  the  set  of  decision  metrics  that 
result  when  using  a  unique  set  of  training  data.  Different  marker  types  correspond  to  the  different  algorithms. 
Points  that  are  close  in  proximity  represent  similar  decision  patterns,  whereas  points  located  far  apart  represent 
dissimilar  classifier  decision  outputs. 
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The  multidimensional  scaling  analysis  was  run  for  five  different  randomly-selected  test 
data  sets.  Each  two-dimensional  projection  is  shown  in  Figure  6.  Across  the  five  test  sets,  there 
are  some  common  trends.  The  DLRT  and  SVM  tend  to  form  fairly  tight  groups;  thus,  their 
decision  metrics  are  fairly  consistent  despite  the  variability  across  the  100  instances  of  the 
training  data  set.  The  Random  Forest  and  SVM  classifiers  are  in  close  proximity  in  the  MDS 
space  suggesting  that  these  two  classifiers  produce  similar  decision  outputs.  The  MAP  classifier 
and  the  DLRT  often  appear  on  opposite  sides  of  the  MDS  subspace,  which  indicates  that  these 
two  classifiers  may  produce  some  of  the  most  different  decision  outputs.  The  plot  also  reveals 
that  the  MAP  classifiers  are  particularly  sensitive  to  the  choice  of  training  samples,  based  on  the 
spread  of  the  points  for  these  classifiers. 

Another  element  in  Task  #1  was  assessment  of  the  effects  of  training  data.  The  first 
experiment,  which  performed  a  baseline  evaluation  of  performance  (PFa  at  PD  =1),  was  re-run 
using  significantly  smaller  training  sets.  In  this  experiment,  100  anomalies  were  randomly- 
selected  for  use  in  training  (with  a  minimum  of  20  UXO)  and  the  remaining  972  anomalies  were 
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Figure  7.  Boxplot  showing  distribution  of  PFA  at  PD  =1  over  1000  sets  of  training/test  data  for  six  classifiers  when 
the  amount  of  available  training  data  is  significantly  reduced  (100  anomalies,  at  least  20  of  which  are  UXO).  Red 
lines  identify  the  median  of  the  distribution  and  the  edges  of  the  blue  box  extend  to  the  25th  and  75th  percentiles. 
The  red  hash  marks  identify  outliers  that  are  beyond  the  25th  and  75th  percentiles  by  more  than  1.5  times  the 
interquartile  distance. 
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Figure  8.  Illustration  of  the  variability  in  the  synthetic  data  set  introduced  by  increasing  the  noise  on. 


used  at  the  test  set.  A  total  of  1000  iterations  were  run  in  this  experiment.  The  distribution  of 
Pfa  at  Pd  =1  is  represented  in  Figure  7  as  a  boxplot.  In  this  scenario,  the  significantly  reduced 
sized  of  the  training  data  set  results  in  much  greater  variability  in  PFa  at  Pd  =1.  However,  the 
overall  pattern  of  performance  is  little  changed;  the  DLRT  consistently  is  the  best  performing 
classifier  in  this  reduced  training  data  scenario. 

Another  consideration  is  the  effect  of  mismatch  between  the  training  and  test  data.  An 
experiment  was  run  using  synthetic  data  with  mismatch  between  the  training  and  test  data  was 
introduced  by  adding  Gaussian  noise  to  the  test  set  features.  Examples  of  the  test  data  for  three 
different  noise  levels  are  shown  in  Figure  8.  In  this  experiment,  the  test  data  consisted  of  1000 
anomalies  equally  split  between  Hi  and  Ho  samples.  Three  training  set  sizes  were  investigated 
(N=200,  400,  and  1000)  with  equal  representation  of  Hi  and  Ho.  A  total  of  800  iterations  were 
run  to  generate  distributions  of  Pfa  at  Pd  =1. 

The  distribution  of  Pfa  at  Pd  =1  for  each  classifier  was  represented  using  two  statistics: 
the  average  and  standard  deviation  of  Pfa  over  the  800  iterations.  These  values  can  be 
represented  as  a  point  in  a  2-D  plot,  as  show  in  Figure  9  left  for  all  classifiers.  The  multiple 
points  connected  by  a  line  represent  the  increasing  mismatch  between  the  training  and  test  sets 
due  to  increasing  on-  The  ideal  performance  point  is  at  (0,0)  in  the  lower  left  comer.  When  both 
performance  and  predictability  are  considered,  the  DLRTi  and  GMM  GLRT  appear  to  provide 
the  best  results.  It  is  worth  noting  that  the  results  are  fairly  sensitive  to  some  classifier 
parameters.  The  DLRT  classifier  was  ran  with  three  values  of  K,  the  sole  user-specified 
classifier  parameter,  which  determines  the  size  of  the  neighborhood  used  to  classify  a  new  test 
sample.  This  sensitivity  is  further  highlighted  in  Figure  9  right,  which  shows  Pfa  at  Pd  =1  for  an 
extensive  range  of  values  for  K,  measured  using  10-fold  cross-validation  on  the  entire  1072 
anomaly  data  set.  Figure  9  right  reveals  that  while  the  DLRT  classifier  is  capable  of  performing 
well  at  Pd  =  1,  it  is  sensitive  to  the  parameter  K. 
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Figure  9.  Left:  Effects  of  mismatch  in  the  training  and  test  data,  represented  by  the  mean  PFA  at  PD  =1  (y-axis)  and 
standard  deviation  of  PFA  at  PD  =1  (x-axis).  PFA  at  PD  =1  statistics  are  calculated  from  800  training/test  data  sets,  and 
each  point  represents  a  distinct  mismatch  condition  (i.e.  amount  of  noise  added  to  produce  mismatch  between 
training  and  test  data).  Right:  Effect  of  DLRT  parameter  K  on  PF  at  PD  =1,  as  measured  based  on  10-fold  cross- 
validation  use  the  baseline  data. 


In  a  similar  format  to  Figure  9,  Figure  10  shows  the  results  for  the  same  training  set 
mismatch  study  using  trainings  sets  with  two  larger  sizes:  N  =  500  (left)  and  N  =  1000  (right). 
One  trend  that  can  be  observed  as  the  size  of  the  training  data  increases  is  the  reduction  in 
variability  of  the  results  for  the  DLRT  classifier,  suggested  by  the  leftward  shift  of  the  three  lines 
representing  the  DLRT.  Thus,  the  standard  deviation  of  PFa  at  PD  =1  for  these  classifiers  is 
decreasing  as  the  amount  of  training  data  increases.  This  result  is  supported  by  previous  studies 
that  have  shown  the  benefit  of  larger  training  set  sizes  for  nonparametric  classifiers  (e.g.  [36]). 

Task  #2:  Sensitivity  to  feature  selection  and  model  inversion 

In  Task  #2,  the  focus  of  the  project  shifted  to  the  prerequisite  stages  in  the  UXO  discrimination 
strategy:  model  inversion  and  feature  selection.  Two  studies  of  feature  selection  were  motivated 
by  questions  about  the  impact  of  the  size  of  the  feature  set  on  performance  at  Pd  =  1  as  well  as 
the  most  appropriate  performance  measure  for  feature  selection  based  on  empirical  feature 
selection  methods. 

In  the  first  study,  feature  sets  of  various  sizes  were  considered  with  four  different 
classifiers.  The  feature  sets  were  organized  into  three  characteristic  groups:  two  time  features 
(Tm),  four  size  features  (Sz),  and  six  shape  features  (Sh).  The  features  in  this  experiment  were 
selected  after  consultation  with  Skip  Snyder  at  Snyder  Geoscience.  The  feature  groups  were 
evaluated  individually  and  in  combination,  which  produced  a  12-feature  superset  in  the  latter 
case.  The  experiment  used  10  realizations  of  10-fold  cross-validation,  producing  100  calculations 
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Figure  10.  Effects  of  mismatch  between  training  and  test  data,  expressed  using  statistics  of  distribution  of  PFA  at  PD 
=1,  when  the  training  set  size  is  increased  to  N  =  500  (left)  and  N=1000  (right). 

of  PF  at  Pd  =1.  Figure  11  shows  the  distribution  of  Pfa  at  Pd  =1  for  the  various  individual, 
paired,  and  complete  feature  set  for  each  of  the  four  classifiers.  In  the  subplots,  the  number  of 
features  used  for  classification  increases  from  the  bottom  (Tm)  to  the  top  (Tm+Sh+Sz).  Based 
on  inspection  of  the  median  of  the  Pfa  distributions  (i.e.  the  red  line  in  each  boxplot)  for  each 
feature  set,  there  does  not  appear  to  be  a  strong  trend  in  performance  as  a  function  of  feature  set. 
This  result  is  not  consistent  with  conclusions  from  an  earlier  SERDP-sponsored  project  (MM- 
1442),  where  it  was  observed  that  larger  feature  sets  provided  better  performance.  One  possible 
explanation  for  the  inconsistency  is  that  the  previous  study  used  empirical  feature  selection 
techniques  to  sequentially  determine  the  feature  sets. 

A  second  study  examined  various  possible  performance  measures  for  empirical 
evaluation  of  a  feature  set  /  classifier  combination.  The  most  obvious  measure  is  Pfa  at  Pd  =1 
since  this  is  the  desired  operating  point  for  the  designed  system.  However,  it  may  also  be  useful 

to  measure  Pfa  at  Pd  =  ( 1 - — )  and  Pd  =  ( 1 - — ),  where  Mjxo  is  the  number  of  UXO  in 

V  WUX0/  V  WUX0/ 

the  test  set.  These  are  performance  measures  that  correspond  to  nearly-perfect  UXO  detection 
by  our  system.  The  area  under  the  ROC  curve  (AUC)  is  a  commonly  used  measure,  as  ROC 
curves  have  often  been  used  to  present  the  results  of  UXO  discrimination  studies.  More  relevant 
to  our  desired  operating  point  is  AUC  for  PD  >0.95,  which  calculates  just  the  area  under  the 
upper  part  of  the  ROC  curve.  Finally,  the  equal  error  rate  (EER)  was  also  included  in  this 
investigation.  The  equal  error  rate  is  the  value  s  such  that  (1-  Pd)  =  Pfa  =  £,  and  it  is  a 
commonly-used  performance  measure  in  other  fields. 

In  this  experiment,  the  sampling-without-replacement  method  was  used  to  generate  test 
and  validation  data  sets,  each  with  50  anomalies.  The  remaining  972  anomalies  were  used  as 
training  data  in  the  DLRT  classifier.  The  experiment  was  repeated  1000  times,  with  each  of  the 
performance  measures  described  above  calculated  for  both  the  test  and  validation  set.  The 
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Figure  11.  Boxplots  showing  distribution  of  PFA  at  Pd  =1  using  seven  feature  sets  of  increasing  dimensionality  with 
four  different  classifiers.  PFA  distributions  were  calculated  from  100  training/test  data  sets  (10  repeats  of  10-fold 
cross-validation).  Red  lines  identify  the  median  of  the  distribution  and  the  edges  of  the  blue  box  extend  to  the  25  th 
and  75th  percentiles.  The  red  hash  marks  identify  outliers  that  are  beyond  the  25th  and  75th  percentiles  by  more  than 
1 .5  times  the  interquartile  distance. 


results  generated  a  1000  by  12  matrix  (six  performance  measures  for  both  test  and  validation 
data  sets).  Figure  12  shows  the  correlation  matrix,  i.e.,  the  correlation  between  the  performance 
measures  over  the  1000  iterations  of  the  experiment.  The  distinct  blocks  of  pixels  in  the  upper 
left  and  lower  right  of  the  image  correspond  to  correlations  of  measures  within  the  same  data  set. 
The  blocks  of  pixels  in  the  upper  right  and  lower  left  correspond  to  correlations  of  performance 
measures  across  the  two  data  sets. 

Within  a  set,  the  AUC  for  PD>0.95  is  most  correlated  with  our  desired  operating  point 
(Pfa  at  Pd=1).  Somewhat  surprisingly,  there  is  very  little  correlation  between  Pfa  at  Pd  =  1  and 

Pfa  at  the  other  PD  levels  considered:  Pd  =  ( 1 - — )  and  PD  =  ( 1 - — ).  Thus,  what  may  be 

v  JXO'  v  WUX(V 

assessed  as  a  good-performing  feature  set/classifier  combination  at  one  measure  may  change  if 
an  incrementally-lower  PD  is  considered  for  the  system.  Looking  to  the  between-set  results, 
there  is  very  little  correlation  between  any  of  the  measures  calculated  on  the  separate  data  sets. 
Ideally,  high  correlation  between  PFa  at  PD  =  1  in  Set  2  and  any  performance  measure  in  Set  1 
would  indicate  good  predictive  power  of  estimating  future  performance  on  new  data.  However, 
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Figure  12.  Matrix  of  correlations  between  various  performance  measures  calculated  within  and  across  two 
independent  test  sets. 


the  performance  measures  evaluated  in  this  study  appear  to  provide  very  little  information  across 
data  sets. 

The  model  inversion  stage  of  the  UXO  discrimination  strategy  is  one  place  where 
mitigating  the  occurrence  of  UXO  outliers  is  critical  to  improving  performance  at  Pd=1.  An 
experiment  was  performed  to  investigate  a  potential  technique  for  identifying  outliers  generated 
by  the  model  inversion.  This  technique  is  based  on  the  hypothesis  that  the  model  inversion 
process  is  fairly  robust  and  consistent.  If  two  data  files  Xi  and  X2  are  very  similar,  even 
identical,  then  the  corresponding  feature  sets  that  result  from  model  inversion  should  also  be 
very  similar.  Note  that  the  inverse  relationship  does  not  need  to  hold  true:  due  to  extrinsic 
parameters  in  the  model  and  different  object  orientations,  two  UXO  with  very  different 
measurements  may  correctly  have  very  similar  feature  representations. 

To  implement  this  outlier  measure,  two  matrices  of  correlation  coefficients  (RD  and  RF) 
were  calculated.  The  (ith,  jth)  entry  in  RD  contains  the  correlation  between  the  measurements  for 
the  ith  and  jth  anomaly.  Similarly,  the  (ith,  jth)  entry  in  RF  contains  the  correlation  between  the 
feature  representations  of  the  ith  and  jth  anomaly  (thus,  both  matrices  are  symmetric).  The  outlier 
measure  for  the  ith  anomaly  is  calculated  based  on  the  following  equation: 

j 
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Figure  13.  Left:  Scatter  plot  of  the  outlier  identification  metric  A  for  each  anomaly.  Marker  color  and  style 
differentiates  UXO  and  clutter.  Higher  values  of  the  outlier  identification  metric  indicate  a  greater  likelihood  that  an 
anomaly  is  an  outlier.  Right'.  Scatter  plot  of  the  features  in  a  two-dimensional  space  (i.e.  Figure  2)  with  potential 
outliers  identified  via  the  outlier  identification  metric.  Red  points  identify  anomalies  with  an  outlier  identification 
metric  greater  than  0.1  (6%  of  the  total  anomaly  set).  A  single  UXO  is  identified  as  an  outlier,  and  it  is  in  fact  the 
UXO  that  is  most  separate  from  the  main  cluster  of  UXO  in  feature  space. 

where  1{  iis  the  indicator  function,  equal  to  one  if  RD(i,j )  >  RF(i,j )  and  zero  otherwise 
(MATLAB  code  available  in  the  Appendices).  The  outlier  measure  has  the  effect  of  looking  for 
dissimilarities  between  clusters  in  data-space  and  clusters  in  feature-space.  When  the  outlier 
metric  A(i)  is  large,  it  indicates  that  the  ith  anomaly  is  dissimilar  (i.e.  low  correlation  RF(i,j))  to 
a  number  of  anomalies  in  feature  space  with  whom  it  was  more  similar  (i.e.  higher  correlation 
RD(i,j ))  in  data  space. 

The  values  of  the  outlier  metric  for  the  1072  anomalies  in  the  data  set  are  plotted  in 
Figure  13  left.  Most  of  the  anomalies  have  a  low  value  of  the  outlier  metric,  indicating  that  the 
correlation  in  data  space  rarely  exceeds  correlation  in  feature  space.  Selecting  0. 1  as  an  arbitrary 
threshold,  64  anomalies  (6%)  are  above  this  threshold  and  could  be  further  investigated  as 
outliers.  It  is  worth  noting  that  this  set  of  flagged  anomalies  contains  one  UXO  which  has  a 
significantly  higher  outlier  metric  than  any  other  UXO.  Looking  at  the  two-dimensional 
representation  of  the  feature  space  plot  (Figure  13  right),  it  can  be  seen  that  the  identified  UXO 
is  indeed  the  most  significant  UXO  outlier2.  Additionally,  several  of  the  outlier  clutter  anomalies 
are  also  identified. 

Conclusions  and  Implications  for  Future  Research 

Recent  SERDP/ESTCP  projects  have  focused  on  the  development  of  sensors  that  are  capable  of 
consistently  producing  high-quality  data.  However,  high-quality  data  cannot  be  fully  utilized 

2  This  UXO  anomaly  is  Master  ID  1475,  a  2.36  inch  motor  -  fins,  60mm  boom 
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without  the  appropriate  follow-on  signal  processing  and  analysis.  With  the  goal  of  operating 
efficiently  at  Pd=1,  this  project  focused  efforts  on  three  stages  present  in  most  UXO 
discrimination  strategies:  classifier  design,  feature  selection,  and  model  inversion. 

Results  from  the  various  simulations  conducted  in  this  project  suggest  that  careful 
consideration  of  Pd=1  should  be  included  as  part  of  the  algorithm  selection.  This  study  provides 
evidence  that  the  desire  to  operate  at  Pd=1  may  lead  to  a  preference  for  certain  algorithms  in  the 
stages  of  the  discrimination  strategy.  In  classifier  design,  the  DLRT  classifier  consistently 
performed  well.  As  a  nonparametric,  neighborhood-based  classifier,  the  DLRT  should  be 
particularly  appropriate  for  the  Pd=1  operating  point  and  the  types  of  feature  space  that  might  be 
observed  in  UXO  discrimination.  The  numerous  simulations  and  analysis  also  revealed  different 
behaviors  by  different  types  of  classifiers  when  faced  with  UXO  identical  to  anomalies  observed 
previously,  the  effects  of  training  set  size,  and  the  ability  to  extrapolate  when  training  data  is 
mismatched  (or  sparse).  Due  to  the  complexity  of  operating  at  PD  =  1,  it  is  unlikely  that  a  single 
classifier  will  be  sufficient,  and  a  multistage  classification  approach  that  fuses  outputs  and 
decisions  from  multiple  algorithms  is  most  likely  necessary  for  robust  performance.  Evidence 
supporting  such  an  approach  can  be  found  in  the  results  presented  in  this  report.  This  study  also 
revealed  the  difficulty  in  estimating  future  performance  on  new  data  sets.  The  performance 
measures  most  commonly  used  for  empirical  selection  of  feature  sets  exhibited  very  little 
correlation  across  test  and  validation  data  sets  independently  drawn  from  the  master  set  of 
anomalies.  Thus,  estimating  the  potential  performance  of  a  feature  set  /classifier  combination  at 
PD  =1  has  proven  to  be  a  difficult  challenge. 

The  investigations  of  sensitivity  to  feature  set  size  did  not  produce  conclusive  results  nor 
yield  any  particular  recommendations  for  either  large  or  small  feature  set  size.  However,  the 
investigation  did  reveal  avenues  of  research  that  could  be  further  explored  in  the  future.  More 
sophisticated  sensors  and  models  have  the  tendency  to  produce  larger  feature  sets,  making  the 
decision  about  how  many  features  to  use  (e.g.  6  versus  26)  a  more  relevant  question. 
Additionally,  with  the  use  of  library-matching  techniques  for  classification  and  use  of  the 
polarizability  curves  as  de  facto  “features”,  a  further  investigation  of  the  effect  of  statistical 
pattern  classification  techniques  applied  to  small  and  large  feature  sets  for  operation  at  Pd  =1 
should  be  considered.  The  previously-mentioned  suggestion  regarding  classifier  fusion  should 
also  incorporate  an  investigation  of  the  most  appropriate  feature  sets  to  be  used  in  the  different 
component  classifiers. 

The  model  inversion  procedure  is  a  critical  stage  in  the  data  processing,  and  a  likely 
potential  source  of  outliers.  The  results  of  the  experiments  in  this  study  suggest  that  analyzing 
model  inversion  results  in  the  context  of  all  available  anomalies  may  be  useful  for  identifying 
aberrant  model  inversions.  Because  of  the  importance  of  the  model  inversion  procedure  for 
successful  operation  at  Pd  =1,  it  will  likely  benefit  from  additional  analyses  such  as  those 
investigated  in  this  study  to  further  assess  the  model  fits.  Also,  the  presence  of  outliers  produced 
by  the  RbstMultiPrince  method,  and  the  ability  to  detect  them  based  on  this  divergence  of 
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data/feature  correlation,  suggest  the  potential  for  improvements  to  model  inversion  for  operating 
at  PD  =1. 

Several  of  the  results  in  this  SEED  project  encourage  further  investigation;  most 
significantly,  the  classifier  characteristics  observed  over  the  various  sets  of  experiments  and  the 
potential  seen  in  the  model  inversion  approach.  Given  the  uniqueness  of  the  various  classifiers 
considered  (Figure  6)  and  the  performance  characteristics  in  different  conditions  of  training/test 
data  (Figure  7,  Figure  9,  and  Figure  10),  there  is  sufficient  reason  to  believe  that  a  follow-on 
investigation  may  be  able  to  identify  a  method  for  classification  that  is  most  appropriate  for  PD 
=1. 
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Appendices 


MATLAB  Code:  Model  inversion  outlier  identification  metric 

function  [OIM]  =  Outlierldentif icationMetric (Data, Features ) 

%  [OIM]  =  Outlierldentif icationMetric (Data, Features ) 

o, 

o 

%  Calculates  a  correlation-based  measure  for  identifying  outliers  from  the 
%  model  inversion  process 

*t> 

%  INPUTS: 

%  Data  -  a  N  x  D  matrix  of  data,  where  D  is  the  number  of  dimensions  (e.g. 

%  #  of  time  samples  multiplied  by  #  of  receivers)  and  N  is  the  number  of 
%  anomalies 

o, 

o 

%  Features  -  a  N  x  Dl  matrix  of  features,  where  D1  is  the  dimensionality  of 
%  the  feature  set  and  N  is  the  number  of  anomalies  as  above 

"o 

%  OUTPUTS: 

%  OIM  -  a  N  x  1  column  of  outlier  identification  metrics,  where  higher 
%  values  suggest  a  greater  likelihood  that  the  anomaly  will  be  an  outlier 
%  in  feature  space 


%  calculate  correlation  coefficients  between  individual  measurements 
rData  =  corrcoef (Data. ' ) ; 

%  calculate  correlation  coefficients  between  resulting  features 
%  (normalized  by  standard  deviation) 

rFeats  =  corrcoef ( (Features . / repmat (std (Features) , size (Features, 1) , 1) ) . ' ) ; 
tempD  =  rData-rFeats ; 

%  calculate  percentage  of  anomalies  for  which  correlation  of  data  exceeds 

%  correlation  of  features 

OIM  =  sum (tempD>0 ).’ /size (tempD, 1 ) ; 
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MATLAB  Code:  Distance  likelihood  ratio  test  (DLRT)  classifier 


function  [LLRT, Yhat ,  PercentCorrect ]  =  DLRTclassif ier (Xtrain, Ytrain, Xtest , Ytest , K) ; 
%  [LLRT, Yhat, PercentCorrect ]  =  DLRTclassif ier (Xtrain, Ytrain, Xtest , Ytest , K) ; 

o, 

o 

%  Implements  the  Distance  Likelihood  Ratio  Test  (DLRT)  classifier. 

o, 

o 

%  INPUTS: 

%  Xtrain  -  a  D  by  N  matrix  of  training  data,  where  D  is  the  number  of 
%  dimensions  and  N  is  the  number  of  training  samples 

%  Ytrain  -  a  1  by  N  vector  of  class  labels  {0,1} 

%  Xtest  -  a  D  by  NN  matrix  of  test  data 

%  Ytest  -  a  1  by  NN  vector  of  class  labels  for  the  test  data  (if 

%  available) .  If  unavailable,  input  empty  brackets  [] 

%  K  -  number  of  neighbors,  must  be  a  positive  integer 

"6 

%  OUTPUTS: 

%  LLRT  -  the  estimates  of  the  likelihood  ratio  calculated  by  the  DLRT 
%  Yhat  -  estimated  class  labels  using  the  minimum-error  threshold 
%  PercentCorrect  -  percentage  of  samples  in  Xtest  correctly  classifier  (if 
%  "Ytest"  was  provided) ,  scale  0  to  100 

o, 

o 

%  Reference:  Remus  et  al .  "Comparison  of  a  distance-based  likelihood  ratio 
%  test  and  k-nearest  neighbor  classification  methods"  Proceedings  of  the 
%  IEEE  Workshop  on  Machine  Learning  in  Signal  Processing  (MLSP)  2008 


%  initialize  outputs 
LLRT  =  zeros ( 1 , size (Xtest , 2 )) ; 

Yhat  =  zeros ( 1 , size (Xtest , 2 )) ; 

%  find  indices  of  the  classes  in  the  training  data 

classes  =  unique (Ytrain) ; 

trainHl  =  f ind ( Ytrain==classes  (2 ) ) ; 

trainHO  =  f ind ( Ytrain==classes  ( 1 ) ) ; 

for  i  =  1 : size (Xtest, 2 ) ; 

distHl  =  sort ( sqrt ( sum ( (Xtrain (:, trainHl )  - 

repmat (Xtest (:, i ), 1 , length (trainHl ))). A2 , 1 ))) ;  %  distances  from  the  i-th  test  point 

to  all  HI  training  samples 

distHO  =  sort (sqrt (sum ( (Xtrain (:, trainHO )  - 
repmat (Xtest (:, i ), 1 , length (trainHO ))). A2 , 1 ))) ;  %  distances  from  the  i-th  test  point 

to  all  HO  training  samples 

LLRT(i)  =  log (numel (trainHO ) /numel (trainHl ) )  + 
size (Xtrain, 1 ) *log (distHO (K) / distHl (K) ) ;  %  calculate  the  LLRT  output 

end 

%  estimate  class  labels  using  the  minimum  error  threshold 
Yhat (LLRT>=0)  =  classes  (2); 

Yhat(LLRTCO)  =  classes  (1); 

%  calculate  percent  correct  if  Ytest  is  provided 
if  numel (Ytest)  ==  size (Xtest , 2 ) ; 

PercentCorrect  =  100*sum ( Yhat==Ytest ) /length (Ytest ) ; 
else 

PercentCorrect  =  nan; 

end 
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